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Abstract 

Graph expansion analysis of computational DAGs is useful for obtaining communication cost lower 
bounds where previous methods, such as geometric embedding, are not applicable. This has recently 
been demonstrated for Strassen's and Strassen-like fast square matrix multiplication algorithms. Here 
we extend the expansion analysis approach to fast algorithms for rectangular matrix multiplication, 
obtaining a new class of communication cost lower bounds. These apply, for example to the algorithms 
of Bini et al. (1979) and the algorithms of Hopcroft and Kerr (1971). Some of our bounds are proved to 
be optimal. 



1 Introduction 

The time cost of an algorithm, sequential or parallel, depends not only on how many computational 
operations it executes but also on how much data it moves. In fact, the cost of data movement, or 
communication, is often much more expensive than the cost of computation. Architectural trends predict 
that computation cost will continue to decrease exponentially faster than communication cost, leading 
to ever more algorithms that are dominated by the communication costs. Thus, in order to minimize 
running times, algorithms should be designed with careful consideration of their communication costs. To 
that end, we discuss asymptotic costs of algorithms in terms of both number of computations performed 
(flops in the case of numerical algorithms) and units of communication: words moved. 

For a sequential algorithm, we determine the communication cost incurred on a simple machine model 



which consists of two levels of memory hierarchy, as described in Section 1.3 In many cases, naive im- 
plementations of algorithms incur communication costs much higher than necessary; reformulating the 
algorithm to performing the same arithmetic in a different order can drastically decrease the communi- 
cation costs and therefore the total running time. In order to determine the possible improvements and 
identify whether an algorithm is optimal with respect to communication costs, one seeks communication 
lower bounds. 

Hong and Kung |17] were the first to prove communication lower bounds for matrix multiplication 
algorithms. They show that on a two-level machine model, any algorithm which performs the Q(n 3 ) flops 
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of classical matrix multiplication must move at least J7(n 3 / y/~M) words between fast and slow memory, 
where M is the number of words that can fit simultaneously in fast memory. Irony, Toledo, and Tiskin [2 2) 
generalized their classical matrix multiplication result to a distributed-memory parallel machine model 
using a geometric embedding argument. Ballard, Demmel, Holtz and Schwartz [4 showed this proof 
technique is applicable to a more general set of computations, including one-sided matrix factorizations 
such as LU, Cholesky, and QR and two-sided matrix factorizations which are used in eigenvalue and 
singular value computations, most of which perform 0(n 3 ) computations in the dense matrix case. Many 
of these bounds on 0(n 3 ) algorithms have been shown to be optimal. 

However, the geometric embedding approach does not seem to apply to computations which do 
not map to a simple geometric computation space. In the case of classical matrix multiplication and 
other 0(n 3 ) algorithms, the computation corresponds to a three-dimensional lattice. In particular, the 
geometric embedding approach does not readily apply to Strassen's algorithm for matrix multiplication 
that requires 0(W og2 7 ) flops. Instead, Ballard, Demmel, Holtz, and Schwartz [5] show that a different 
proof technique based on analysis of the expansion properties of the computational directed acyclic graph 
( CD AG) can be used to obtain communication lower bounds for both sequential and parallel models for 
these algorithms. The proof technique can also be used to bound how well the corresponding parallel 
algorithms can strongly-scale [2], We use this same approach here to prove bounds on fast rectangular 
matrix multiplication algorithms, which introduce some extra technical challenges. 

1.1 Expansion and communication 

The CDAG of a recursive algorithm has a recursive structure, and thus its expansion can be analyzed 
combinatorially (similarly to what is done for expander graphs in [30l[TJ[26j) or by spectral analysis (in 
the spirit of what was done for the Zig-Zag expanders [31]). Analyzing the CDAG for communication 
cost bounds was first suggested by Hong and Kung [T7]- They use the red-blue pebble game to obtain 
tight lower bounds on the communication costs of many algorithms, including classical 0(n 3 ) matrix 
multiplication, matrix- vector multiplication, and FFT. Their proof is obtained by considering dominator 
sets of the CDAG. 

Other papers study connections between bounded space computation and combinatorial expansion- 
related properties of the corresponding CDAG (see e.g., [321 El IB] and references therein). The study of 
expansion properties of a CDAG was also suggested as one of the main motivations of Lev and Valiant 
[28] in their work on superconcentrators and lower bounds on the arithmetic complexity of various 
problems. 

1.2 Fast rectangular matrix multiplication 

Following Strassen's algorithm for fast multiplication of square matrices |33j . the arithmetic complexity of 
multiplying rectangular matrices has been extensively studied (see [IH1 [H] H21 HH1 HOI [2D E] and further 
details in .12!). When there is an algorithm for multiplying an m x n matrix A with annxp matrix 
B to obtain an m x p matrix C using only q scalar multiplications, we use the notation (m, n,p) = 
The above studies try to minimize the number of multiplications q (as a function of m,n, and p). A 
particular focus of interest is maximizing a so that (n, n, n a ) = 0(n 2 logn) namely maximizing the size 
of a rectangular matrix, so that it can be multiplied (from right) with a square matrix, in time which is 
only slightly more than what is needed to read the input]^] Recall that (m, n,p) = (n,p, m) — (p, m, n) = 
(m,p,n) = (p,n,m) — (n,m,p) for all m,n,p [18] . 

Rectangular matrix multiplication is used in many algorithms, for solving problems in linear algebra, 
in combinatorial optimization, and other areas. Utilizing fast algorithms for rectangular matrix multi- 
plication has proved to be quite useful for improving the complexity of solving many of those problems 
(a very partial list includes [111 [3 EH1 [23 Ell 1311 El] ) ■ 

1 Recall that (m,n,p) — q implies that for all integers t, (m* ,n* ,p*) — q t by recursion (tensor powering), and also that the 
arithmetic complexity of (m ,n ,p ) is 0(q t ) regardless of the number of additions in (m,n,p). 

2 Note that our approach may not apply to algorithms of the form {n,n,n a } — 0(n 2 log n). It only applies to algorithms 
that are a recursive application of a base-case algorithm. 
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1.3 Communication model 



We model communication costs on a sequential machine as follows. Assume the machine has a fast 
memory of size M words and a slow memory of infinite size. Further assume that computation can be 
performed only on data stored in the fast memory. On a real computer, this model may have several 
interpretations and may be applied to anywhere in the memory hierarchy. For example the slow memory 
might be the hard drive and the fast memory the DRAM; or the slow memory might be the DRAM and 
the fast memory the cache. 

The goal is to minimize the number of words W transferred between fast and slow memory, which 
we call the communication cost of an algorithm. Note that we minimize with respect to an algorithm, 
not with respect to a problem, and so the only optimization allowed is re-ordering the computation in 
a way that is consistent with the CDAG of the algorithm. The sequential communication cost is closely 
related to communication costs in the various parallel models. We discuss this relationship briefly in 
Section [6] 



1.4 The communication costs of rectangular matrix multiplication 

The communication costs lower bounds of rectangular matrix multiplication algorithms are determined 
by properties of the underlying CDAGs. Consider (m*, n l ,p l ) = q* matrix multiplication that is gener- 
ated from t tensor powers of (m,n,p) = q. Denote the former by the algorithm and the latter by the 
base case, and consider their CDAGs. They both consist of four parts: the encoding graphs of A and B, 
the scalar multiplications, and the decoding graph of C . The encoding graphs correspond to computing 
linear combinations of entries of A or B, and the decoding graph to computing linear combinations of 
the scalar products. See Figure [T] in Section [4] for a diagram of the algorithm CDAG, and Figure [2] in 
Section [5] for an example of a base-case CDAG. Let us state the communication cost lower bounds of 
the two main cases. 

Theorem 1 Let (m ,n ,p ) — q be the algorithm obtained from a base case (m,n,p) = q. If the 
decoding graph of the base case is connected, then the communication cost lower bound is 

o* 

W = SI ' 



M \o Smp q-l ^ 

Further, in the case that n < m and n < p this bound is tight. 

Note that in the case m = n = p, this result reproduces the lower bound for Strassen-like square 
matrix multiplication algorithms in 5 . In this case, for ujq = log n q, we obtain W — SI J ■ 

Theorem 2 Let (m ,n ,p ) — q l be the algorithm obtained from a base case (m,n,p) = q. If an 
encoding graph of the base case is connected and has no multiply-copied input^ then 



W = SI 



t l °&N <1M 1o Sn 9- 1 



where N = mn or N = np is the size of the input to the encoding graph. Further, this bound is tight if 
N = maxjmn, np, mp} , up to a factor of i log « q , which is a polylogarithmic factor in the input size. 

We also treat the cases of disconnected encoding and decoding graphs and obtain similar bounds with 
restrictions on the fast memory size M. See Corollaries [l3| and [l4| in Section [2J 

These theorems and corollaries apply in particular to the algorithms of Bini et al. [TT] and Hopcroft 
and Kerr [TH] , which we detail in Section [5j 

1.5 Paper organization 

In Section [2] we state some preliminary facts about the computational graph and edge expansion. Sec- 
tion [3] explains the connection between communication cost and edge expansion. The proofs of the lower 
bound theorems stated in Section \1A\ as well as some extensions, appear in Section |4j In Section [5] 
we apply our new lower bounds to two example algorithms: Bini's algorithm and the Hopcroft-Kerr 
algorithm. Appendix |A"] gives further details of Bini's algorithm and the Hopcroft-Kerr algorithm. 

3 See Section [2] for a formal definition. 
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2 Preliminaries 



2.1 The Computational Graph 

For a given algorithm, we consider the CDAG G = (V,E), where there is a vertex for each arithmetic 
operation (AO) performed, and for every input element. G contains a directed edge (u,v), if the output 
operand of the AO corresponding to u (or the input element corresponding to u) , is an input operand to 
the AO corresponding to v. The in-degree of any vertex of G is, therefore, at most 2 (as the arithmetic 
operations are binary). The out-degree is, in general, unbounded, i.e., it may be a function of \V\. 



2.1.1 The relaxed computational graph. 

For a given recursive algorithm, the relaxed computational graph is almost identical to the computational 
DAG with the following change: when a vertex corresponds to re-using data across recursive levels, we 
replace it with several connected "copy vertices," each of which exists in one recursive level. While the 
CDAG of a recursive algorithm may have vertices of degree that depend on \V\, this relaxed CDAG has 
constant bounded degree. We use the relaxed graph to handle such cases in Section [42} 



2.1.2 Multiply-copied vertices. 

We say that a base-case encoding subgraph has no multiply-copied vertices if each input vertex appears 
at most once as an output vertex. An output vertex v is copied from an input vertex if the in-degree 
of v is exactly one. See, for example, Figure [2] The vertex an is copied to the third output of Enc\A 
but is not copied to any other outputs. Since all other inputs are also copied at most once, there are no 
multiply-copied vertices in Figure [2| 

This condition is necessary for the degree of the entire algorithm's encoding subgraph to be at most 
logarithmic in the size of the input. We are not aware of any fast matrix multiplication algorithm that 
has multiply-copied vertices, although the recursive formulation of classical matrix multiplication does. 

2.2 Edge expansion 

The edge expansion h(G) of a d-regular undirected graph G = (V, E) is: 

h(r , . \E(U,V\U)\ 
h(G) = mm 

UQV,\U\<\V\/2 d ■ \U\ 

where E(A, B) = Eq(A, B) is the set of edges connecting the vertex sets A and B. We omit the subscript 
G when the context makes it clear. Treating a CDAG as undirected simplifies the analysis and does not 
affect the asymptotic communication cost. For many graphs, small sets expand more than larger sets. 
Let h s {G) denote the edge expansion for sets of size at most s in G: 

, , r ,_ ■ \E(U,V\U)\ 

h s (G) = mm — — . 

uqv,\u\<s d ■ \U\ 

Note that CDAGs are typically not regular. If a graph G = (V, E) is not regular but has a bounded 
maximal degree d, then we can add (< d) loops to vertices of degree < d, obtaining a regular graph G' . 
We use the convention that a loop adds 1 to the degree of a vertex. Note that for any S C V, we have 
\Ec(S, V \ S)\ — \Eg'(S, V\S)\, as none of the added loops contributes to the edge expansion of G' . 



2.3 Matching sequential algorithm 

In many cases, the communication cost lower bounds are matched by the naive recursive algorithm. The 
cost of the recursive algorithm applied to (m*, n ,p ) — q , taking N* — max{mn, np, mp} is 



W(t) 



q ■ W(t - 1) + O ((A*) 4 " 1 ) if (N*Y > M/3 
3(A*)* otherwise 



since the algorithm does not communicate once the three matrices fit into fast memory. The solution to 
this recurrence is given by 



w = e 



q< 



][,fog N , q-l 
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3 Communication Cost and Edge Expansion 



In this section we recall the partition argument and how to combine it with edge expansion analysis 
to obtain communication cost lower bounds. This follows our approach in 012]. A similar partition 
argument previously appeared in \17\ \22\ |4j , where other techniques (geometric or combinatorial) are 
used to connect the number of flops to the amount of data in a segment. 

3.1 The partition argument 

Let M be the size of the fast memory. Let O be any total ordering of the vertices that respects the 
partial ordering of the CDAG G. This total ordering can be thought of as the actual order in which the 
computations are performed. Let V be any partition of V into segments Si,S2,---, so that a segment 
Si € V is a subset of the vertices that are contiguous in the total ordering O. 

Let Rs and Ws be the set of read and write operands, respectively. Namely, Rs is the set of vertices 
outside S that have an edge going into S, and Ws is the set of vertices in S that have an edge going 
outside of S. Then the total communication costs due to reads of AOs in S is at least |i?s| — M, as at 
most M of the needed |i?5| operands are already in fast memory when the execution of the segment's 
AOs starts. Similarly, S causes at least \Ws\ —M actual write operations, as at most M of the operands 
needed by other segments are left in the fast memory when the execution of the segment's AOs ends. 
The total communication cost is therefore bounded below by 

W > min (\Rs\ + \W S \ - 2M) - (1) 
sev 

3.2 Edge expansion and communication cost 

Consider a segment S and its read and write operands Rs and Ws- 

Proposition 3 If the graph G containing S has h s (G) edge expansioi^for sets of size s = \S\, maximum 
(constant) degree d, and at least 2\S\ vertices, then \Rs\ + \Ws\ > \ • h s (G) ■ \S\ . 

Proof We have \E(S, V \ S)\ > h s (G) ■ d ■ \S\. Either (at least) half of the edges E(S, V\S) touch R s 
or half of them touch Ws- As every vertex is of degree d, we have \Rs\ + \Ws\ > max{|i?,s|, \ Ws\} > 
l-i-\E(S,V\S)\>h s (G)-\S\/2. I 

Combining this with ([I]) and choosing to partition V into \V\/s segments of equal size s, we obtain: 
W > max s lp • |^ - 2Af). Choosing the minimal s so that 



h s (G)-s 
2 

we obtain 



> 3M (2) 



IVI 

W > - — - ■ M . (3) 

s 

In some cases, as in fast square and rectangular matrix multiplication, the computational graph G 
does not fit this analysis: it may not be regular, it may have vertices of unbounded degree, or its edge 
expansion may be hard to analyze. In such cases, we may then consider some subgraph G' of G instead 
to obtain a lower bound on the communication cost. The natural subgraph to select in fast (square and 
rectangular) matrix multiplication algorithms is the decoding graph or one of the two encoding graphs. 



4 Expansion Properties of Fast Rectangular Matrix Multipli- 
cation Algorithms 

There are several technical challenges that we deal with in the rectangular case, on top of the analysis 
in [5] (where we deal with the difference between addition and multiplication vertices in the recursive 

4 For many algorithms, the edge expansion h(G) deteriorates with \G\, whereas h s (G) is constant with respect to \G\, which 
allows for better communication lower bounds. 



5 




i A I , B | 
r n r *\ 

(mn)< (np)' 



Figure 1: Computational graph for (m t ,n t ,p t ) = q rectangular matrix multiplication generated from t 
recursive levels with base graph given by {m, n,p) = q. In this figure m < p < n. 

construction of the CD AG). These additional challenges arise from the differences between the CDAG 
of rectangular algorithms, such as Bini's algorithm and the Hopcroft-Kcrr algorithm on the one hand, 
and of Strassen's algorithm on the other hand. The three subgraphs, two encoding and one decoding, 
are of the same size in Strassen's and of unequal size in rectangular algorithms. The largest expansion 
guarantee is given by the subgraph corresponding to the largest of the three matrices. One consequence 
is that it is necessary to consider the case of unbounded degree vertices that may appear in the encoding 
subgraphs. Additionally, in some cases the encoding or decoding graphs consist of several disconnected 
components. 



4.1 The computational graph for (m*,n*,p*) = <f 

Consider the computational graph H t associated with multiplying a matrix A of dimension to* x n by 
a matrix B of dimension n* x p*. Denote by EnctA the part of H t that corresponds to the encoding of 
matrix A. Similarly, EnctB, and DectC correspond to the parts of H t that compute the encoding of B 
and the decoding of C, respectively (see Figure [T]). 

4.1.1 A top-down construction of the computational graph. 

We next construct the computational graph ffj+i by constructing Deci+\C from DeCiC and Dec\C and 
similarly constructing EnCi + iA and En,Ci + iB, then composing the three parts together. 

1. Duplicate Dec\C q % times. 

2. Duplicate DeCiC mp times. 

3. Identify the mp ■ q 1 output vertices of the copies of Dec\C with the mp ■ q l input vertices of the 
copies of DeCiC: 

• Recall that each Dec\C has mp output vertices. 

• The first output vertex of the q l Dec\C graphs are identified with the q l input vertices of the 
first copy of DeCiC. 

• The second output vertex of the q l Dec\C graphs are identified with the q 1 input vertices of 
the second copy of DeCiC . And so on. 

• We make sure that the jth input vertex of a copy of DeciC is identified with an output vertex 
of the jth copy of Dec\C. 

4. We similarly obtain Enci + iA from EnciA and Enc\A, 

5. and Enci + \B from EnciB and Enc±B. 

6. For every i, Hi is obtained by connecting edges from the jth output vertices of EnciA and Enc-iB 
to the jth input vertex of Dec-iC. 
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This completes the construction. Let us note some properties of this graphs. 
As all out-degrees are at most mp and all in degree are at most 2 we have: 

Proposition 4 All vertices of Dec t C are of degree at most mp + 2, as long as n > 1 (that is, as long 
as the base case is not an outer product). 

Proof If the set of input vertices of Dec\C and the set of its output vertices are disjoint, then the 
proposition follows.. Assume (towards contradiction) that the base graph Dec\C has an input vertex 
which is also an output vertex. An output vertex represents the inner product of two n- vectors, i.e., 
the corresponding row- vector of A and column vector of B. The corresponding bilinear polynomial 
is irreducible. This is a contradiction, since n > 1 an input vertex represents the multiplication of a 
(weighted) sum of elements of A with a (weighted) sum of elements of B. I 

Note, however, that Enc\A and Enc\B may have vertices which are both inputs and outputs, 
therefore EnctA and EnctB may have vertices of out-degree which is a function of t. In [5J [5], it 
was enough to analyze DectC and lose only a constant factor in the lower bound. However in several 
rectangular matrix multiplication algorithms, it is necessary to consider the encoding graphs as well, 
since they may provide a better expansion than the decoding graph. 

Lemma 5 If Dec\C is connected, then the edge expansion of DectC is 

h(DectC) = Q 




Proof The proof follows that of Lemma 4.9 in [5] adapting the corresponding parameters. We provide 
it here for completeness. Let G t = (V,E) be Dec t C, and let S C V, \S\ < \V\/2. We next show that 

\E(S, V \ 5)| > c ■ d ■ \S\ ■ (j^j ) where c is some universal constant, and d is the constant degree of 
DectC (after adding loops to make it regular). 

The proof works as follows. Recall that Gt is a layered graph (with layers corresponding to recursion 
steps), so all edges (excluding loops) connect between consecutive levels of vertices. We argue (in 
Proposition [9|) that each level of Gt contains about the same fraction of S vertices, or else we have 



many edges leaving S. We also observe (in Fact 10 1 that such homogeneity (of a fraction of S vertices) 



does not hold between distinct parts of the lowest level, or, again, we have many edges leaving S. We 
then show that the homogeneity between levels, combined with the heterogeneity of the lowest level, 
guarantees that there are many edges leaving S. 

Let k be the iih level of vertices of G t , so (mp)* = \h\ < \h\ < ■ ■ ■ < \k\ = (mp) t ~ l+1 q l ~ 1 < • • • < 
|Zt+i| = q l . Let Si = S n k. Let a — j^j be the fractional size of S and Oi = jj^j- be the fractional size 
of S at level i. Let Si = o~i — er i+1 . Due to averaging, we observe the following: 

Fact 6 There exist i and i' such that Oi < a < ay . 
Fact 7 

t+1 t+1 / \ i 

w\ = Em=Eim-(^) 



\k+i\- i- 



t+2\ 



mp\ 

q J j q — mp 



\ q J \ \ q J ) q — mp 



so 2=^2 < %M < . 1 and 2=^2 • ( 2*V < M < q -=^- ■ ( 28 V • , 1 , t+2 . 

q — \V\ — q 1 _^rr^y + -> ' q \ q J ~ \V\ — q \ q J i_(nmy +2 

Proposition 8 There exists d = c'{G x ) so that \E{S,V\S) H E{k,k +1 )\ > d ■ d ■ |^| • \U\. 

Proof of Proposition [8] Let G' be a G\ component connecting U with k + i (so it has mp vertices in 
li and q in k + i). G' has no edges in E(S, V \ S) if all or none of its vertices are in S. Otherwise, as G' 
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is connected, it contributes at least one edge to E(S, V \ S). The number of such G\ components with 



— . Therefore, ther o ot-0 Q+ ,ooof — ■ ■ 1 • — 

mp ' 

components with at least one vertex in S and one vertex that is not. 



all their vertices in S is at most min{<7i, ■ j^. Therefore, there are at least \oi — cr^ +1 1 • ^ q x 



Proposition 9 (Homogeneity between levels) Ij there exists i so that \— J'^ > T,, then 

\E(S,V\S)\>c-d-\S\-(^ 
where c > is some constant depending on G\ only. 

Proof of Proposition [9] Assume that there exists j so that ^ <7 ~ <7j ■ > jq. By Proposition [sj we have 

\E(S,V\S)\ > ^2\E(S,V\S)nE(k,l i+1 )\ 
ie[t] 

> Yc'-d-\6i\-\k\ 



ie[t] 



> c'-d-\h\^2\Si 

■ie[t] 



> c' ■ d- \li \ - I max <7j — min er, 

\i£{t+l] i£[t+l] 

By the initial assumption, there exists j so that CTj > jr , therefore max^ er^ — min^ o~i > jk , then 



|25(£,nS)l>^-<Hli|-^ 



ByFact^y >a(^)*.|F| 



> c • a • 
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Afl|5|=(7.m 

> c-d - 151 

for any c < fg • *=f* I 

Let T t be a tree corresponding to the recursive construction of Gt in the following way: Tt is a tree of 
height t + 1, where each internal node has mp children. The root r of T t corresponds to k+i (the largest 
level of Gt). The mp children of r correspond to the largest levels of the mp graphs that one can obtain 
by removing the level of vertices lt+i from Gt- And so on. For every node u of T t , denote by V u the set 
of vertices in G t corresponding to u. We thus have \V r \ = q t where r is the root of T t , \V U \ — g t_1 for 
each node u that is a child of r; and in general we have {mp) 1 tree nodes u corresponding to a set of size 
\V U \ = Each leaf I corresponds to a set of size 1. 

For a tree node u 1 let us define p u — ■ ^y^j" to be the fraction of S nodes in V u , and 5 U = \p u — p p («)|, 
where p(u) is the parent of u (for the root r we let p(r) = r). We let tj be the ith level of T t , counting 
from the bottom, so tt+x is the root and t\ are the leaves. 

Fact 10 As V r = lt+i we have p r = at+i- For a tree leaf u G t±, we have \V U \ = 1. Therefore 
p u G {0, 1}. The number of vertices u in t\ with p u = 1 is <J\ ■ \l±\. 

Proposition 11 Let Uq be an internal tree node, and let Ui,U2, ■ ■ ■ ,u mp be its mp children. Then 

]T \E(S, v\S)n E(V Ui , K )| > c" • d ■ \Pu, - Pu 1- IK, I 



where c" = c"{G x ). 
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Proof of Proposition |11| The proof follows that of Proposition [8] Let G' be a G\ component 
connecting V Uo with Uie[mp] ( so ^ nas 1 vertices in V Uo and one in each of V Ul ,V U2 ,. . . ,V Ump ). G' 
has no edges in E(S, V \ S) if all or none of its vertices are in S. Otherwise, as G" is connected, it 
contributes at least one edge to E(S, V \ S). The number of G\ components with all their vertices in S 
is at most imn{p Uo , p Ul , p U2 , . . . , p Ump 

}' mp ' Therefore, there are at least maxjg[ mp j{ | 
r~^2 • Sig[mp] \Pu-i ~ Pu \ ' G\ components with at least one vertex in S and one vertex that is 

not. I 



We have 



\E(S,V\S)\ = \E(S,V\S)nE(V u ,V p(u) )\ 

u£T t 



By Proposition 11 this is 



> ^ C " ' d ' \Pu ~ Pp(u) \ ■ |K| 



uer t 



= c " • d • Y Y \ pu ~ p p( u ) i ■ ql 1 

iG[t] «eti 

- c " ■ d ' X] 1' 9 " ~ (0 p(«)i ■ ( m py^ 

ie\t] ueU 



As each internal node has mp children, this is 

= c" -d - Y \Pu~ P P (u)\ 

where v ~ r is the path from v to the root r. By the triangle inequality for the function 



By Fact [TUJ 

>cf' -d-\h\-((l-ai)- p r + a!-(l- p r )) 
By Proposition [5J w.l.o.g., \<Jt+i — p\/o < ^> and |cri — ct|/ct < j^. As p r = <Jt+i, 



> --c"-d-|/i|-CT 



and by Fact [7j 



for any c < | • c" 
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Using Lemma 2.1 of [5] (decomposition into edge disjoint small subgraphs) we deduce that for 
sufficiently large t, 



h s (Dec t C) = n(^-^y gqS ^ 



Thus there exists a constant c such that for s = cM l ° s ™p q , s ■ h s (DectC) > 3M. Plugging this into 
inequality ^ we obtain Theorem [l] 

4.2 Stretching a segment 

We next consider the case where all vertices have a degree bounded by 0(t). We analyze the edge 
expansion of the relaxed computational graphj^j which corresponds to the same set of computations 
but has a constant degree bound. We then show that an augmented partition argument (similar to 



that in Section 3.1) results in a communication cost lower bound which is optimal up to at most a 
polylogarithmic factor. 

Since a relaxed encoding graph has a constant degree bound we can analyze the expansion of the 
EnctA and EnctB parts of the computational graph by exactly the same technique used for DectC 
above. Plugging in the corresponding parameters, we thus obtain: 

Lemma 12 Let G' t be the relaxed computational graph of computing (m , n ,p ) = q l based on (to, n,p) = 



„ t — „ — . „ „ 1 a . „ x ,.„ „ 1 3 \"- jr / i 

Let Enc' t A and Enc' t B be the subgraphs corresponding to the encoding of A and B in G' t 

log, A 

h s (Enc' t A) = fi I ( — ) and h s (Enc' t B) = O 



Then 





Consider a CDAG G with maximum degree 0{t) and its corresponding relaxed CD AG G' of constant 
degree. Given the expansion of G' we would like to deduce the communication cost incurred by computing 
G. To this end we need amended versions of inequalities ([2| and ([3|; since by transforming G' back 
to G \R S \ + \ W S \ may contract by a factor of 0(t), we need to compensate for that by increasing the 
segment size s. To be precise, we want _ 2M = M. Following inequality (Jij), we thus choose 

the minimal s s uch that h s (EnctA) ■ s > c'tM, where c! is some universal constant. I3y inequality ([3| 



and Lemma 
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[^f\ ■ s = Q(tM), so 



ir = v. I — — % m 



{tM) l °Smnl' 



and Theorem [2] follows. 



4.3 Disconnected encoding or decoding graphs 

The CDAG of any fast (rectangular or square) matrix multiplication algorithm must be connected, due 
to the dependencies of the output entries on the input entries. The encoding and decoding graphs, 



however, are not always connected (see e.g., Bini's algorithm, in Section 5.1 and Appendix[Aj). Consider 
a case where each connected components of Dec t C is small enough to fit into the fast memory. Then 
our proof technique cannot provide a nontrivial lower bound. Even if a connected component is larger 
than M, but has < M inputs and < M outputs, the partition into segments approach provides no 
communication cost lower bound (see inequality and its proof). In the case that the inputs of an 
encoding graph or the output of the decoding graph do not fit into fast memory, and the disconnected 
components all have the same number of input and output vertices, the lower bound technique still 
applies. Formally, 

Corollary 13 If the base-case decoding graph is disconnected and consists of X connected components 
of equal input and output size, then W = Q ^ / io 8?7lp/ g x (,/x)-i ) ■ 



5 See Section [2] for a formal definition. 
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Proof Since DectC is disconnected h(DectC) = 0. However it consists of X 1 connected components, 
each of which has nonzero expansion, therefore the entire graph does have expansion for small sets. Each 
connected component is recursively constructed from a base graph with q/X inputs and mp/X outputs. 
By Lemma [5] each connected component CCt of DectC has expansion 

h(CC t 

In order to apply Lemma 2.1 of [5] (decomposition into edge disjoint small subgraphs), we decompose 
Dec t C into connected components of size s, where s needs to satisfy two conditions. First, s must be 
smaller than the size of the connected components of Dec t C (otherwise we cannot claim any expansion), 
namely 

Second, s must be large enough so that the output of one component does not fit into fast memory 
(otherwise the expansion guarantee does not translate into a communication lower bound): 

where k = \og q / x s is the number of recursive steps inside one component. We then deduce that 

h s (Dec t C) = Q (^^y gq/X . 

Thus there exists a constant c such that for s = c M l ° s ^p/ x{q/x) , s ■ h s (Dec t C) > 3M. Plugging this 

into inequality ( 3 1 we obtain Corollary 

above does not apply, but the result still 
output must be written: W = Q ((mp)*) 
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Note that in the case that M = fi ( Ji the argument 

holds because it is weaker than the trivial bound that the entire 



Corollary 14 If a base-case encoding graph is disconnected and consists of X connected components 
of equal input and output size, has N inputs, where N — mn or N — np, and has no multiply- copied 

inputs, then W = Sl ( t io gw/ , (g /x) jL gw/x(g /x)-i ) • 

Proof Let G' t be the relaxed computational graph of computing (m t ,n t ,p t ) — q f based on (to, n,p) = q. 
Let Enc' t be the subgraph corresponding to the encoding of A or B in G' t , and N be mn (for the encoding 
of A) or np (for the encoding of B). Then by the same argument as above, 



h 8 {Enc' t ) = 9, 

Since by transforming G' back to G the sum \R S \ + \ W S \ may contract by a factor of 0(t) (recall 



Section 4.2), we need to compensate for that by increasing the segment size s. Thus the above only 
holds for 



N ^ k 

X 



Q(Mt), 



where k = \og q / x s. It follows that there exists a constant c such that for s — c{tM) x ° Sm "' xlyq ^ x \ 
s ■ h s (Enc' t ) > 3tM. Plugging this into inequality ^ we obtain Corollary 14 Note that in the case that 



M = fi ( (yj J, the argument above does not apply, but the result still holds because it is weaker than 
the trivial bound that the entire input must be read: W = fl (AT 4 ). I 
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Figure 2: Computational graph for 1 level of Bini's (3,2,2) = 10 algorithm. Solid lines indicate depen- 
dencies of additions and make up Enc\A, Enc\B, and Dec\C. Dashed lines indicate dependencies of 
multiplications and connect these three subgraphs. Note that Enc\A, the bottom-left part of the graph, 
is disconnected and has two connected components of equal size and equal input /output ratio. Note that 
the base-case graph of Bini's algorithm is presented, for simplicity, with vertices of in-degree larger than 
two. A vertex of degree larger than two, in fact, represents a full binary (not necessarily balanced) tree. 
The expansion arguments hold for any way of drawing the binary trees. 

5 The Communication Costs of Some Rectangular Matrix Mul- 
tiplication Algorithms 

In this section we apply our main results to get new lower bounds for rectangular algorithms based on 
Bini's algorithm [TT] and the Hopcroft-Kerr algorithm [IS]. All rectangular algorithms yield a square 
algorithm. In the case of Bini the exponent is ojq « 2.779, slightly better than Strassen's algorithm 
(luq ~ 2.807), and in the case of Hopcroft-Kerr the exponent is loq w 2.811, slightly worse than Strassen's 
algorithm. These algorithms are stated explicitly, which is not true of most of the recent results that 
significantly improve uj . See Table [l] for an enumeration of several algorithms based on [111 119) and 
their lower bounds. 

5.1 Bini's algorithm 

Bini et al. [llj obtained the first approximate matrix multiplication algorithm. They introduce a param- 
eter A into the computation and give an algorithm that computes matrix multiplication up to terms of 
order A. It was later shown how to convert such approximate algorithms into exact algorithms without 
changing the asymptotic arithmetic complexity, ignoring logarithmic factors [10] Q 

Bini et al. show how to compute 2x2x2 matrix multiplication approximately where one of the 
off-diagonal entries of an input matrix is zero using 5 scalar multiplications. This can be used twice 
to give an algorithm for (3, 2, 2) = 10 matrix multiplication. Notably this algorithm has disconnected 
Enc\A (see Figure [2]). 

From this (3, 2, 2) = 10 algorithm one immediately obtains 5 more algorithms by transposition and 
interchanging the encoding and decoding graphs 18J. Other algorithms can be constructed by taking 
tensor products of these base cases. When taking tensor products, the number of connected components 
of each encoding and decoding graph is the product of the number of connected components in the base 
cases. For example there are 4 ways to construct algorithms for (6,6,4) = 100: one where Enc\A and 
Enc\B each have two components, one where Enc\A and Dec\C each have two components, one where 
EnciB and DeciC each have two components, and one where Enc\A has four components. Similarly 
there are 8 ways to construct algorithms for the square multiplication (12, 12, 12) = 1000. 

6 We treat here the original, approximate algorithm, not any of the exact algorithms that can be derived from it. 
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Table 1: Asymptotic lower bounds for several variants of the algorithms by Bini et al. and Hopcroft- 
Kerr. Many more with different shapes and with different disconnected subgraphs can be given for Bini's 
algorithm, and analyzed by similar means; we list only a representative sample. Recall that the base case 
(m,n,p) = q is used for the computation of (m t ,n t ,p t ) = q l . 

5.2 The Hopcroft-Kerr algorithm 

Hopcroft and Kerr [115] provide an algorithm for (3, 2, 3) = 15, and prove that fewer than 15 scalar 
multiplications is not possible. In their algorithm, all the encoding and decoding graphs are con- 
nected. Thus, only Theorems [T] and [2] are necessary for proving the lower bounds. For the square 
case (18, 18, 18) = 3375, Theorem [I] reproduces the result of Ej. 

6 Discussion and Open Problems 

Using graph expansion analysis we obtain tight lower bounds on recursive rectangular matrix multipli- 
cation algorithms in the case that the output matrix is at least as large as the input matrices, and the 
decoding graph is connected. We also obtain a similar bound in the case that the encoding graph of the 
largest matrix is connected, which is tight up to a factor that is polylogarithmic in the input, assuming 
no multiply copied inputs. Finally we extend these bounds to some disconnected cases, with restrictions 
on the fast memory size. Whenever the decoding graph is not the largest of the three subgraphs (equiva- 
lently, whenever the output matrix is smaller than one of the input matrices), or when the largest graph 
is disconnected, our bounds are not tight. 

6.1 Limitations of the lower bounds. 

There are several cases when our lower bounds do not apply. These are cases where the full algorithm 
is a hybrid of several base algorithms combined in an arbitrary sequence. Consider the case where two 
base algorithms are applied recursively. If the recursion alternates between them, our lower bounds 
apply to the tensor product of the two base cases, which can be thought of as taking two recursive 
steps at once. However, for cases of arbitrary choice of which base case to apply at each recursive step, 
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we do not provide communication cost lower bounds. The technical difficulty in extending our results 



in this case lies in generalizing the recursive construction of the decoding graph given in Section 4.1.1 



Similarly, if the base-case decoding (or encoding) graph is disconnected and contains several connected 
components of different sizes, our bounds do not apply. In this case the connected components of the 
entire decoding (or encoding) graph are constructed out of all possible interleavings of the different 
connected components. Finally, the lower bounds do not apply to algorithms that are not recursive, 
including approximate algorithms that are not bilinear. 



6.2 Parallel case. 

Although our main focus is on the sequential case, we note that the sequential communication bounds 
presented here can be generalized to communication bounds in the distributed-memory parallel model 
of [3]. The lower bound proof technique here can be extended to obtain both memory-dependent and 
memory-independent parallel bounds as in [2J. Further, the Communication Avoiding Parallel Strassen 
(CAPS) algorithm presented in [3] is shown to be communication-optimal and faster (both theoretically 
and empirically) than previous attempts to parallelize Strassen's algorithm [5]. The parallelization 
approach of CAPS is general, and in particular it can be applied to rectangular matrix multiplication, 
giving a communication upper bound which matches the lower bounds in the same circumstances as in 
the sequential case. 



6.3 Blackbox use of fast square matrix multiplication algorithms. 

Instead of using a fast rectangular matrix multiplication algorithm, one can perform rectangular matrix 
multiplication of the form (m*,n*,p') with fewer than the naive number of (mnp) multiplications by 
blackbox use of a square matrix multiplication algorithm with exponent loq (that is, an algorithm for 
multiplying n x n matrices with O(n wo ) flops). The idea is to break up the original problem into 
(tf) ' («*) sc l uare matrix multiplication problems of size (n*) x (n*)FJ The arithmetic cost of such a 
blackbox algorithm is Q((mpn u ° Using the upper and lower bounds in [3], the communication cost 

is f (mpn"Q- a n 
lb U V AT-o/2-i j ■ 

We note that, in some cases, blackbox use of a square algorithm may give a lower communication 
cost than a rectangular algorithm, even if it has a higher arithmetic cost. In particular, if q < mpn^ ^ 2 , 
then the rectangular algorithm performs asymptotically fewer flops. It is possible to have simultaneously 
ujq /2 > log mp q, meaning that for certain values of M and t the communication cost of the rectangular 
algorithm is higher. On some machines, the arithmetically slower algorithm may require less total time 
if the communication cost dominates. 
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A Details of Bini's and the Hopcroft-Kerr algorithm 

In this appendix we give the details of Bini's algorithm [IT] and the Hopcroft-Kerr algorithm [19]. We 
provide these for completeness. 

We express an algorithm for (m,n,p) = q matrix multiplication by giving the three adjacency 
matrices of the encoding and decoding graphs: U of dimension mn x q, V of dimension np x q, and W 
of dimension mp x q. The rows of U, V, and W , correspond to the entries of A, B, and C, respectively, 
in row-major order. The columns correspond to the q multiplications. To be precise, each column of 
U specifies a linear combination of entries of A; and each column of V specifies a linear combination of 
entries of B. These two linear combinations are to be multiplied together, and then the corresponding 
column of W specifies to which entries of C that product contributes, and with what coefficientF] 



A.l Bini's algorithm 



We provide all 6 base cases for Bini's algorithm that appear is Section 5.1 They are labeled by the 



shape of the multiplication and which graph is disconnected. The first algorithm is: 
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8 The sparsity of the matrices in this notation correspond loosely to the number of additions and subtractions, but this 
notation is not sufficient to specify the leading constant hidden in the computational costs. In particular, this notation does 
not show the advantage of Winograd's variant of Strassen's algorithm [15] over Strassen's original formulation [33] . 
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The remaining 5 algorithms can be concisely expressed in terms of the rows of the first algorithm: 
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A. 2 The Hopcroft-Kerr algorithm 

For the Hopcroft-Kerr algorithm we give only 3 of the 6 base cases, since all the graphs are connected. 
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